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REDUCING VARIANCE IN UNIVARIATE SMOOTHING 

By Ming- Yen Cheng, Liang Peng-*^ and Jyh-Shyang Wu 

National Taiwan University, Georgia Institute of Technology 
and Tamkang University 

A variance reduction technique in nonparametric smoothing is 
proposed: at each point of estimation, form a hnear combination of a 
prehminary estimator evaluated at nearby points with the coefficients 
specified so that the asymptotic bias remains unchanged. The nearby 
points are chosen to maximize the variance reduction. We study in 
detail the case of univariate local linear regression. While the new 
estimator retains many advantages of the local linear estimator, it has 
appealing asymptotic relative efficiencies. Bandwidth selection rules 
are available by a simple constant factor adjustment of those for local 
linear estimation. A simulation study indicates that the finite sample 
relative efficiency often matches the asymptotic relative efficiency for 
moderate sample sizes. This technique is very general and has a wide 
range of applications. 

1. Introduction. Local linear modeling for nonparametric regression has 
many advantages and has become very popular. Hastie and Loader [16], 
Wand and Jones [27], Fan and Gijbels [11] and others have investigated ex- 
tensively its theoretical and practical properties. To reduce the variance, for 
each point x, take a special linear combination of this local linear estimators 
at three equally spaced points around x. The linear combination satisfies 
certain moment conditions so that the asymptotic bias is unchanged. Below 
are a few specific features of this new estimator. First, both local and global 
automatic bandwidths can be easily obtained from those for the standard lo- 
cal linear estimator. Second, the asymptotic mean squared error is improved 
considerably and the amount of reduction is uniform across different loca- 
tions, regression functions, designs and error distributions. Third, evidenced 
by a simulation study, the reduction in asymptotic variance is effectively 
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projected to finite samples. Fourth, the estimators admit simple forms and 
only slightly increase the amount of computation time. Finally, many ad- 
vantages of the local linear estimator, for example, design adaptivity and 
automatic boundary correction, are retained. 

In kernel density estimation, Kogure [18] studied a related topic on poly- 
nomial interpolation and obtained some preliminary results. The purpose 
is to find optimal allocation of the interpolation points by minimizing the 
asymptotic mean integrated squared error. Higher-order polynomial inter- 
polation does not change the asymptotic bias but alters the asymptotic 
variance in a nonhomogeneous way unless the spacings are all large enough, 
which was assumed in seeking the optimal allocation. 

Fan [9] showed that the local linear estimator is minimax optimal among 
all linear estimators. The proposed estimators are linear and have smaller 
asymptotic mean squared errors than the local linear estimator. There is no 
conflict between these two results since the assumptions are different. In [9], 
the maximum risk of a linear estimator is taken over a class of regression 
functions that can be approximated well linearly. The asymptotic results 
for our estimators require the slightly more stringent condition that the 
regression function has a bounded, continuous second derivative at the point 
of estimation. 

There is an extensive literature on modifications and improvements of 
kernel and local linear estimators. Many of them are aimed at bias reduction, 
for example, Abramson [1], Samiuddin and El-Sayyad [22], Jones, Linton 
and Nielsen [17] and Choi and Hall [7]. Choi and Hall [7] exploit the idea of 
taking linear combinations of some local linear estimators as well. The key 
difference is that we take linear combinations to maintain bias but reduce 
variance while Choi and Hall [7] take linear combinations to reduce bias 
while essentially maintaining variance. More specifically, (i) they use local 
linear estimates at points symmetrically located around x and we do not, 
(ii) they take a convex combination and we employ different constraints on 
the coefficients, (iii) bandwidth selection is straightforward in our case but 
is much more complicated for theirs, and (iv) our method applies to local 
constant fitting estimators but theirs does not. There are very few variance 
reduction techniques available. An error-dependent technique in local linear 
smoothing was suggested by Cheng and Hall [3]. The amount of variance 
reduction depends on the error distribution. Cheng and Hall [4] introduced 
an adaptive line integral of the bivariate kernel density estimate to reduce 
variance. It does not apply to the univariate case or regression estimation. 
These two methods require explicitly or implicitly a pilot estimation of the 
function or its derivatives. Our procedure is based on a completely different 
idea and effectively reduces variance without pilot estimation. 

Section 2 presents the methodology and Section 3 provides results on the 
asymptotic mean squared error and coverage accuracy. Section 4 discusses 
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some practical issues, including bandwidth selection and implementation. 
Section 5 contains a numerical study. Possible generalizations and applica- 
tions of the variance reduction technique are discussed in Section 6. Proofs 
are given in Section 7. 

2. Methodology. 

2.1. Local linear regression. Suppose that independent bivariate obser- 
vations {Xi,Yi), . . . , {Xn, Yn) are sampled from the regression model 



where (t(X) is the conditional variance and the random error e, indepen- 
dent of X, has zero mean and unit variance. Kernel estimators of m(x) = 
E{Y\X = x) include the Nadaraya- Watson, Gasser-M tiller and local poly- 
nomial estimators; see [8, 25, 27]. Let K he a, kernel function and /i > be 
a bandwidth. The local linear regressor is defined as 



where = hEtii^ - X.^Khix -X,),l = 0, 1, 2, r„,,(x) = /iE?=i(2; - 

XiY Kh{x — Xi)Yi, / = 0, 1, and Kh{t) = K{t/h)/h. This estimator is obtained 
by solving a local linear least squares problem. Its appealing theoretical and 
numerical properties were discussed by Fan [9] and Fan and Gijbels [11], 
among others. Further, it has become very popular, widely used in appli- 
cations and implemented in statistical software. When the denominator is 
close to zero, it exhibits a rather unstable numerical behavior. This is par- 
ticularly problematic for small samples or sparse designs. Remedies include 
modifications proposed by Seifert and Gasser [23], Cheng, Hall and Titter- 
ington [5], Hall and Turlach [15] and Seifert and Gasser [24]. Although we 
focus on the local linear estimator denoted by fh{x), our variance reduction 
methods given below can be applied to any such modification. 

2.2. Variance reduced estimators. A motivation of our variance reduc- 
tion strategy is to incorporate more data points in the regression estimation 
in such a way that the first-order term in the asymptotic bias remains un- 
changed. To do this, at each x we construct a linear combination of the local 
linear estimators at some equally spaced points near x, with the linear co- 
efficients satisfying certain moment conditions derived from the asymptotic 
bias expansions. Then variance reduction is achieved since the three prelim- 
inary estimators are correlated and the correlation coefficients are less than 
1. In this context, we fix the number of nearby points at three since that 
is the minimal requirement specified by the moment conditions and using 



Y = m{X) + a{X)e, 




Sn,2 

Sn,o{x)Sn,2{x) - Sn,l{x) Sn,l{x) 
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more than three yields complex solutions. In addition, the grid points are 
taken so that the final estimator is the simplest and most efficient in the 
mean squared error sense. 

Formally, for any x, define three equally spaced points axj = x — (r + 1 — 
j)uJn,j = 0, 1,2, where r £ (—1, 1) and w„ = 5h for some constant 6 > 0. The 
shift parameter r determines the location of the leftmost point axfi relative 
to X and w„ is the spacing of the grid. Construct a quadratic interpolation of 
the local linear estimators m{axfi) ^^io^x,!) and 'm(ax,2) and then estimate 
m(x) by the value of the interpolated curve at x: 

fnq{x)= ^ Aj{r)fh{x-{r + l-j)u]ri), 
i=o,i,2 

where the coefficients depend only on r and are given by 

Ao{r)=r{r-l)/2, Ai(r) = (1 - r^), = r(r + l)/2. 

Then the moment conditions (7.4) are satisfied so that fhq{x) has the same 
asymptotic bias as m(x). Theorem 1 and Proposition 1 show that fhq{x) has 
a smaller asymptotic variance than m{x) and the asymptotic variances differ 
from each other by a constant factor depending on the bin-width parameter 
(5, the shift parameter r and the kernel K. Given K and (5, the optimal values 
of r that minimize this constant factor are r = it ^1/2, which give 

rfi±{x) = ylj(±l/\/2)m(x- (±l/\/2 + l- j)cj„). 

i=0,i,2 

Since m+(x) uses more data information on the left-hand side of x than 
on the right-hand side, the curve estimate fhj^[-) tends to shift to the right 
of m(-). Similarly, m_(-) tends to shift to the left of m(-). This symmetry 
suggests taking the average 

fha{x) = {m^{x) + rh_{x)}/2 

as our final estimator. Sections 3 and 5 demonstrate that, compared to 
ni±{x), rha{x) further improves the asymptotic and finite sample efficiencies. 

When Supp(m) is bounded, Supp(m) = [0, 1] say, since each of m±(x) and 
fha{x) requires values of in at points around x, a 6 value, 

(2.2) 5{x) = mm{5,x/{^Jl/2 + l)h, (1 - x) / {^Jl/2 + l)h} , 

which depends on the distances from x to the boundary points and 1, is 
used so that ma{x) is defined for every x £ [0, 1]. 

Our estimators fh±(x) and fha{x) are simple linear combinations of local 
linear estimators evaluated at nearby points. No pilot estimation is involved 
in this variance reduction. Therefore, in finite samples, the asymptotic effi- 
ciencies are achieved at relatively small n. For the same reasons, they share 
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many advantages enjoyed by the local linear estimator, for example auto- 
matic boundary correction and design adaptivity; see [11]. It is shown in 
Section 3 that each of m±{x) and fha{x) reduces the asymptotic variance 
uniformly across different regression functions, different designs and differ- 
ent error distributions. Then procedures such as design planning employed 
in applications of rh apply to our estimators in general. Thus the proposed 
estimators are rather user friendly. 

2.3. Confidence intervals. Consider constructing confidence intervals for 
m(x) based on m{x) and rhq{x). Let Vij = J s'^K{sy ds and vqi = J {J2i=o i 2 



A{r)Kis + i6)y ds. BeRne Wijk{x) = n-^h^"'~^j:?=i{x-Xiy{Kh{x-Xi)yx 
{Yi - mix)}'' and a\x) = n"! j:?=i Kh{x - Xi){Yi - m{x)Y / wmo{x) . Then 



are asymptotically A^(0, 1) distributed when n/i^ — > 0, and we have the fol- 
lowing one-sided confidence intervals for m{x): 



where zg satisfies P{N{{), 1) < z^} = (3. 

3. Theoretical results. The following conditions are needed for asymp- 
totic analysis of the estimators. 

Assumption {Ck)- K \s a, symmetric density function with compact 
support. 

Assumption (C^j). 

1. m"(-), the second derivative of m(-), is bounded and continuous at 

X- 

2. the density function /(•) of X satisfies f{x) > and \ f{x) — f{y)\ < 
c\x — y\°' for some < a < 1; 

3. (T^(-) is bounded and continuous at x. 

3.1. Asymptotic mean squared error. Fan [9] showed that, under the con- 
ditions (Cmj), (Ck), h^O and n/i ^ oo as n ^ oo, 



both 




(2.4) 



(2.3) 



Ip = {fh{x) - zi3{a'^{x)/woio{x)}'^/'^i'l^^{nh) ^/^,oo), 
If3 = {fhg{x) - Zf3{a'^{x)/woio{x)}'^^'^ul^'^{7ih)~'^/'^,oo 







(3.1) 



m(x) = 



Sn,2 

{x)Tnfl{x) - Sn,l{x)Tn,l{x) 



Snfi{x)Sn,2{x) - Sn,l{x) Sn,lix) + U 



-2 ' 
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a modification of (2.1) that admits asymptotic unconditional variance, has 

(3.2) E{m(x)} - m(x) = ^m" {x)L'2oh'^ + oj/i^ + {nh)'^/'^}, 

(3.3) Var{m(x)} = ^^z^02 + o{h^ + (nh)-'}. 

nhj [x) 

Theorem 1. Suppose that 6 > is a constant. Assume (Cmj), {Ck), 
/i — > and nh — > oo as n —> oo. Then 

(3.4) E{mg(x)} - m{x) = \m"{x)v2Qh^ + o{h^ + {nh)~^/^], 

(3.5) Var{m,(x)} = ^^z>02 + o{/i' + {nhy^}. 

nil J \X) 

Note that 1/02 = z^02 - ?'^(1 — ?^^)C'('5), where 

C((5) = 1.5C(0, 5) - 2^(0.5, <5) + 0.5C(1, 5) 
with C(a, (5) = / (t - a(^)i^ (t + a5) dt. 

Proposition 1. The quantity C{5) has the following properties: 

(a) For any symmetric kernel function K , C{5) > for any 5 > 0. 

(b) // K has a unique maximum and is concave, then C {5) is increasing 
in 5 > 0. 



Remark 1. From (3.2) and (3.4), fhq{x) and fh{x) have the same asymp- 
totic bias. From (3.3), (3.5) and Proposition 1, the asymptotic variance of 
'fhq{x) is smaher than that of fh{x) by the amount {n/i/(x)}~-^cT^(x)r^(l — 
r'^)C{5). Note that < r'^{l - r^) < 1/4 for any r G (-1, 1) \ {0} and at- 
tains its maximum 1/4 at r = ±y^l/2. Therefore, for any 6 > 0, the optimal 
choices of r are r = ±^/Tj2, which yield m±{x). 

Remark 2. A generalization of mq{x), based on local linear estimators 
at zq = X — r6h, zi = x — (r — k)dh and Z2 = x — {r — k — l)6h, for some 
< A: / 1, is J2j=o,i,2Bjir)rh{zj), where Boir) = r{r - l)/k{k + 1), Bi{r) = 
— {r + k){r — l)/k and B2{r) = r{r + k) / {k + 1) . It has the same asymptotic 
bias as m(x) and asymptotic variance {nhf{x)}~^a'^{x)T{5,r,k), where 

r{6,r,k)=uo2 E Bj{rf + 2Bo{r)Bi{r)C{k,6/2) 
i=o,i,2 

+ 2Bo{r)B2ir)Cik + 1,5/2) + 2Bi{r)B2{r)C{l,6/2). 

In T{6,r,k), K interacts with r, k and 5 and there is no explicit value of r 
minimizing T{6,r,k) for given S and k. 
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Corollary 1. Under the conditions in Theorem 1, as n— >oo, 

(3.6) E{m±(x)} - m{x) = \m"{x)v2oh^ + o{h^ + (nh)-^/'^}, 

(3.7) Var{m±(x)} = ;^^{^^02 " ^} + o{h^ + {nhy^}. 
Theorem 2 can be proved using arguments similar to the proof of Theorem 

1. 

Theorem 2. Under the conditions in Theorem 1, as oo, 

(3.8) E{ma{x)} - m(x) = ^m"{x)u2oh'^ + o{h^ + {nhy^/'^}, 

(3.9) Var{m„(.)} = ;^{-02 - ^ - ^} + o{h' + (n/.)"^, 
w/iere 

^(-5) = 1^02 - lC{6) 

-^{A{1 + V2)C{V2- 1,6/2) 
+ {3 + 2V2)C{2- V2,6/2) 
+ 2C{V2, 5/2) + 4(1 - V2)C{V2 + 1, 5/2) 
+ (3-2^/2)C7(^/2 + 2,5/2)}. 

Proposition 2. T/ie quantity D{5) in (3.9) is nonnegative for any 5 > 

0. 

Remark 3. If m^^\x) exists, then the second-order term in the bias of 
fha{x) is OQi^), smaller than those of fh{x) and fh±{x). 

Remark 4. Suppose that supp(J^) = [-1,1]. Then, for 5 > 2, C{5) = 
(3/2) 1/02 and 

Var{m„(x)} k, {I - |r^(l - r^)} Var{m(rE)}, 

(3-10) . , \ 

Var{m-|-(x)} ~ |Var{m(2;)}. 

For 5 > 2/(v/2 - 1), D{5) = (5/8)i/o2 and 

(3.11) \w{ma{x)}K,^\w{fh{x)}. 

If K is infinitely supported, then (3.10) and (3.11) hold for sufficiently large 
5. 
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Fig. 1. AMSE relative efficiency. The left and right panels respectively plot 7g(5) and 
7a((5) against 5 for the Uniform (solid), Epanechnikov (dotted) and Normal (dashed) 
kernels. 

Remark 5. If 5 = o(l), then fhq{x), fh±{x) and ffia{x) all have the 
same asymptotic bias and variance as rh{x). If 5 ^ oo with 6 = o(/i~^/^) and 
m"'{x) exists, then the biases of fhq{x) and fh±{x) remain the same as in 
(3.4) and (3.6) and the variances are as in (3.10). If (5 — > oo with 5 = o(/i~^/^) 
and m^'^\x) exists, then the bias and variance of fha{x) are as in (3.8) and 
(3.11). 



From Corollary 1, the pointwise (global) asymptotic efficiency, in terms 
of asymptotically optimal (integrated) MSE, of m±{x) relative to ifi{x) is 

(3.12) ^q{5) = {vo2-C{5)/A}-^l\t2- 

In addition. Theorem 2 implies that the pointwise or global asymptotic 
relative efficiency of fha{x) compared to fh{x) is 

(3.13) 7a(5) = {^02 - C(5)/4 - Z)(<^)/2}-4/54/^ 

Both 7g((5) and 7a (5) depend only on K and 5 and do not involve the re- 
gression function m, the design density / or the error distribution. Thus, 
asymptotically, the variance reduction methods do not interfere with these 
factors. Figure 1 plots 7^(5) and 7a (5) against 5 when K is the Uniform, 
K{u) = 0.5/(|u| < 1), Epanechnikov, K{u) = 0.75(1 - u^)I{\u\ < 1) or Nor- 
mal, K{u) = (27r)-i/2gxp(-u2/2), kernel. 

3.2. Coverage probability. Coverage probabilities of the confidence inter- 
vals Ijs and 1/5 for m(x), given in (2.3) and (2.4) and constructed based on 
m(x) and fhq{x), are analyzed and compared as follows. 

Theorem 3. Assume {Ck), {Cmj), h = o{n~^/^) and nh / logn ^ oo 
as n — > oo . Then 



P{m{x) e Ip] 



(3.14) 
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= (3 + {nh'>f/H-^v2i^^^'^<y-\x) 

- {nh)-^'^Q-^v~^'^a-\x)r^'\x)V^{x) 
X {^ozizp - 1) - 2,vl2zl}(l){zp) 

+ o{{i/V^+hf}, 

P{m{x) E Ip] 

= 13+ {nh'>fH-^V2iv^^'^a-Hx) 
xf^l\x)m"{x){zl-m^p) 

- {nhr^'H-%^'^a-\x)r^'\x)V^{x) 

X ~ 1) ~ ^^l24^(t){Zf}) 

+ 0{{l/^ + hf}, 
where V^ix) = E[{Y - m{x)Y\X = x]. 

Corollary 2. Assume the conditions of Theorem 3. // {m"(x)(z| 
3)}-Hz^03(4 - 1) - ^2^} < and {m"ix)iz} - 3)}-^{miz} - 1) 
3?o2-z|} < 0, then 

TM,r)^ lim I^M^) ^ M " /3| 



(3.15) 



n— >oo 



miii/j |P{m(x) G //j} - j3\ 
(3.16) 

■Z.03(4-l)-3^224l'/'r^02l^/3 



^^03(4 - 1) - 31^02^ ^^02 

Figm'e 2 plots ro.95((5, r) against r and 5 for the Uniform, Normal and 
Epanechnikov kernels. Similarly to the mean squared error comparison, 
ro.95('5)'') is always greater than or equal to 1, nondecreasing in 5 both 
for r = ibl/-v/2 and for the best r, and settles at an upper limit. Also, 
ro.95('5, il/\/2) is very close to the optimal max,. ro.95((^, r). Results for 
other (3 values are similar. The limit lim^^^o ™axr r^((5, r) is roughly 1.15 if 
(3 > 0.9, 1.18 if /3 = 0.85, and 1.24 when /3 = 0.75 or 0.7. U (3 = 0.75 or 0.7, 
Tp{S,r) can be less than 1, but such small confidence levels are rarely used 
in practice. For any fixed 6, taking r = ibl/\/2, as employed by fh±{x), the 
coverage accuracy is always not far from optimal. Table 1 gives r^((5, ±l/\/2) 
for different (3 and 6 values. There are significant gains in terms of coverage 
accuracy for 5 >1 when using the Epanechnikov kernel. 
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Table 1 
Coverage accuracy ratio 



5 


0.6 


0.8 


1.0 


1.2 


1.6 


2.0 


Uniform 


1.035 


1.047 


1.060 


1.080 


1.116 


1.139 




1.031 


1.042 


1.054 


1.074 


1.115 


1.152 




1.027 


1.037 


1.047 


1.067 


1.113 


1.167 




1.022 


1.031 


1.039 


1.060 


1.112 


1.184 


Epanechnikov 


1.024 


1.045 


1.067 


1.088 


1.123 


1.136 




1.022 


1.045 


1.072 


1.099 


1.139 


1.151 




1.021 


1.045 


1.078 


1.110 


1.156 


1.167 




1.019 


1.045 


1.084 


1.124 


1.177 


1.185 


Normal 


1.001 


1.003 


1.006 


1.011 


1.027 


1.047 




1.001 


1.004 


1.008 


1.015 


1.035 


1.059 




1.002 


1.004 


1.010 


1.019 


1.042 


1.072 




1.002 


1.006 


1.013 


1.023 


1.051 


1.086 



r^((5, ±l/\/2) for the Uniform, Epanechnikov and Normal kernels. From the top, the rows 
respectively represent j3 = 0.95, 0.9, 0.85 and 0.8. 

4. Implementation. 

4.1. Bandwidth selection. Bandwidth choice is most crucial in kernel 
smoothing and dominates the performance. Compared to m{x), m±{x) and 
fha{x) do not further complicate the bandwidth selection problem. This 




1 F 1 1 ° "^ P 1 1 rJ * ~^ 1 P 1 r 

01514 01314 



Fig. 2. Coverage accuracy relative efficiency. Top row: perspective plots of To.<j5{S,r) . 
Bottom row: plotted against S are argmaxr ro.95(5, r) (dotted), l/y/2 (solid), 
maxr ro.95(iS, r) [dotted- dashed) and To. 95(5, ±l/\/2) {dashed). From left, the columns cor- 
respond to the Uniform, Epanechnikov and Normal kernels. 
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property does not necessarily hold for other variance or bias reducing mod- 
ifications. 

First, consider local bandwidths which are needed when the underlying 
population has sharp features in the regression, design or error distribution. 
The optimal local bandwidth that minimizes the asymptotic mean squared 
error (AMSE) of m(x) is 

which gives the optimal AMSE 

(4.1) 1.25{m"ixf,.iy{x)/fix)y/^4'n-^/\ 
For either of ffi±{x), the optimal local bandwidth is 

(4.2) hi{x) = {1^02 - C (5)/ 4}'^' 1^02^' hoix), 
yielding the optimal AMSE 

(4.3) 1.25{m"{xfiyiy{x)/f{x)}^/^uo2 - C{6)/A}'^/^n-'^/^. 
The bandwidth that minimizes the AMSE of fha{x) is 

(4.4) ha{x) = {z/02 - C{5)/A - D{5)/2y/\^^"'ho{x), 
giving the optimal AMSE 

(4.5) l.2b{m"{xfuloa\x)/f\x)Y">{uo2 - C(5)/4 - D{5) /2Yl''n-^'\ 

An implication of (4.2) and (4.4) is that, for any given 5 and K, after adjust- 
ing by an appropriate constant multiplier, any local data-driven bandwidth 
designed for fh{x) is readily applicable to each of m±{x) and fha{x). In 
addition, regardless of what the regression, design or error distribution is, 
the multiplicative factors in (4.2) and (4.4) all remain the same for differ- 
ent X values. Automatic local bandwidth selectors for local linear regression 
include those of [2, 10, 20]. 

Alternatively, for a generic kernel estimator m of m, consider the optimal 
global bandwidth that minimizes the asymptotic mean integrated squared 
error (AMISE), that is, the first-order term in the mean integrated squared 
error E / {fh{x) — m(x)}^/(x) dx. The optimal global bandwidths /iq, hi and 
ha respectively minimizing the AMISE's of m, ifi± and fha admit the rela- 
tions 

(4.6) hi = {uo2 - C(5)/4}i/^o2'^'/io, 

(4.7) ha = {uo2 - C{5)/A - D{5)/2Yl'>u~^"'ho. 

Again, any automatic global bandwidth for the local linear estimator (see, 
e.g., [13, 21]) can be accordingly adjusted for implementation of m± and 
fha- 
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Note that the constant factors in (4.2) and (4.6) are the same, and those 
in (4.4) and (4.7) are equal. Thus, bandwidth selection for our estimators 
is very simple. This advantage arises from the fact that r is kept fixed for 
all X, thereby enhancing AMSE performance uniformly across different x, 
regressions, designs and error distributions. 

4.2. Bin width. In the construction of fh-i-{x) and fha{x), the equally 
spaced points Oxfi, ax,i and ax,2 have bin width 6 > 0. li 6 = o(l), then there 
is no variance reduction; see Remark 5. If d diverges with 6 = o(/i~^/^), for 
iri±{x), or 6 = o{h~^/^), for ifia{x), although it is argued in Remark 5 that the 
new estimators have good asymptotic properties, there can be adverse effects 
on the finite sample biases. In applications caution is needed when employing 
constant 6. Each of fh±{x) and ma{x) uses observations further away from 
x when using larger values of 6 and that may substantially increase the 
finite sample bias. In general, larger values of 5, for example, 5 G [1.5,2], 
are recommended only when n is large or when the estimation problem is 
not difficult, that is, the regression function is smooth and the noise level 
is low. Otherwise, a smaller 6 is preferred. Furthermore, 5 = 1 is a good 
default. Results of a numerical study, reported in Section 5, support these 
suggestions. 

4.3. Computation. A naive way to compute the proposed estimators is 
to calculate the required local linear estimators and then form the linear 
combinations. Then the computational effort is increased by some constant 
factors. This extra burden can be avoided by a careful implementation. Sup- 
pose the estimators are computed over an equispaced grid. Letting the spac- 
ing of the grid be a multiple of u>n reduces the number of kernel evaluations 
to essentially the same as required by fh. In addition, fast implementation of 
kernel estimators, for example, fast Fourier transform or binning methods of 
Fan and Marron [12], can be employed to alleviate the computational effort. 

5. Numerical performance. A simulation study was carried out to inves- 
tigate the finite sample performances. Three regression functions (see [24]), 

1. bimodal, m(x) = 0.3exp{-16(x - 0.25)^} -h0.7exp{-64(rE - 0.75)2}, 

2. linear with peak, m{x) = 2 — 5x -|- 5exp{— 400(rE — 0.5)^}, 

3. sine, m{x) =sin(57rx), 

were considered. The design was Uniform(0, 1). The random error e was 
Normal(0, 1) distributed and a{x) = kao for all x G [0, 1], where k = 0.5, 1 or 
2 and fio = 0.1, VO. 5 and 0.5 for the bimodal, linear with peak and sine re- 
gression functions, respectively. The sample size was 25,50,100,250 or 500. 
To avoid the data sparseness problem, let ■m}i'T{x) be the local linear esti- 
mator employing the interpolation method of Hall and Turlach [15] with the 
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parameter r therein being taken as 3. Let mcii{x) denote the bias reduc- 
tion method of Choi and Hah [7] apphed to mi{T{x) rather than the local 
linear estimator. Our variance reduction methods, using 5= 0.6, 0.8, 1, 1.2 
or 1.6, were applied to ifi}iT{x). The Epanechnikov kernel was employed. 
The bandwidth h ranged over {0.008 • l.l'',A; = 0,l,...,40}. In each setting 
1000 samples were simulated. The mean integrated squared error (MISE) 
of an estimator is approximated by the average of the 1000 numerical inte- 
grals of the squared errors. The integrated squared bias (ISB) and integrated 
variance (IV) are likewise approximated. 

Figure 3 depicts the MISE, ISB and IV curves under one of the config- 
urations. Our estimator fha significantly improves on mnx in terms of IV. 
The second-order bias effect of fha is more apparent when h is large or 5 is 
large. The ISB curve of mcH is much smaller than that of rn-HT except when 
h is large. 

Finite sample efficiency of an estimator relative to mHT can be measured 
by the ratio of the minimal, over h, MISE values. Figure 4 plots some of the 
results. Our estimator ifia, using any of the 5 values, outperforms mHT most 
of the time. The bias reduction estimator men is better than fha with 5 <1 
when the regression function is smooth and when n > 100. Interestingly, even 
when n = 500, fhcn has no significant advantage over fha with 5 = 1.2 or 1.6. 
Although we do not report the simulation results on the linear with peak 
regression function, we observe that, in this case, rPj-cH performs roughly the 
same as muT) but our variance reduction estimator fha outperforms rriHT- 
These observations coincide with previous studies on bias reduction methods 
that they usually depend on certain higher-order approximations which take 
effect only when the curve is smooth and n is large. In the high-noise case 
{k = 2), TTT-HT breaks down and there is not much improvement employing 
any modification. 

Examining Figure 4 more closely, we have the following conclusions. First, 
6 = 1 is a reliable choice for general purposes. Except when the noise level 




Fig. 3. MISE, ISB and IV curves. From left to right are MISE, ISB and IV plotted 
against h when the regression is hiraodal, the design is U(0, 1), fc = 1 and n — 100. The 
line types solid, long-dashed, dotted, short-dashed, dashed, dotted-spaced and dotted-dashed 
respectively represent fhuT, fncH and rua with 5 = 0.6,0.8,1,1.2 and 1.6. 
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Fig. 4. MISE relative efficiency. Top {or bottom) row plots against n the MISE effi- 
ciencies relative to vtiht when regression is the bimodal {or sine) function and design is 
U(0, 1). From left, the columns correspond to the noise levels A; = 0.5, 1 and 2. Line types 
represent the estimators and are as in Figure 3. 



is high (k = 2), the relative efficiency for 5 = 1 is already above 1.1 from 
n = 100 and is not far from the asymptotic value 7a (1) ~ 1-22 (see Figure 
1) for moderate sample sizes. In all of the cases, the curves for 6 = 0.6 
and 0.8 become almost flat starting from n = 100. Besides 6 = 1, larger 5 
values, for example, 6= 1.2, may have potential in practice. In general, for 
smooth curves like the sine function or low noise levels, gains resulted from 
employing 5 > 1 over using 5 <1 become noticeable for n > 500. 

Although the simulation results of m± are not given here, the perfor- 
mances of ifi± are hard to differentiate from each other, the average version 
fha copes with the second-order bias effect better than fh±, and m±{x) us- 
ing 5 < 1.2 outperform iriHT niost of the time. The simulation study has 
been summarized in terms of the MISE performances. The pointwise mean 
squared errors, not reported here, provide similar conclusions. In addition, 
the MSE relative efficiencies of the proposed estimators are roughly un- 
changed as X varies. Two truncated Normal designs, A^(0.5,0.5^) n (0,1) 
and A^(0, 1) n (0,1), were experimented with under all the considered con- 
figurations as well. In all cases, the relative MISE efficiency of each of our 
variance reduction estimators behaves consistently across the three different 
designs. 
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6. Generalizations and applications. Generalizations of the methodology 
are multifold and comparatively easy. First, in kernel estimation of curves 
and their derivatives using higher-order kernels or by fitting higher-degree lo- 
cal polynomials, a linear combination of some preliminary estimators at more 
than three neighboring points is required in order that moment conditions, 
in the same spirit as those in (7.4), are satisfied. See also the asymptotic 
mean expressions (7.3) and (7.5). Second, extension to estimation of multi- 
dimensional surfaces is possible. Cheng and Peng [6] made some progress in 
this regard: when using product kernels, form a grid in every direction as in 
the one-dimensional case, and then take the coefficient of one preliminary es- 
timator as the product of all the corresponding one-dimensional coefficients. 
Asymptotic relative efficiency, in terms of MSE or MISE, compared to the 
local linear smoother is jqiS)"^ and is 7a (<5)'^ for the average version. 

The proposed variance reduction strategy is fairly simple, allowing a wide 
range of potential applications. Variance reduction is particularly useful 
when very few data points can be used in the estimation, for example, es- 
timating quantities in conditional distributions. Other useful applications 
include various local modeling techniques such as local likelihood estima- 
tion, varying coefficient models and hazard regression. In situations where 
the covariance structure of the preliminary estimators is completely anal- 
ogous to (7.6), direct application is appropriate. Examples include kernel 
density estimation and are found in local likelihood modeling [19, 26]. Notice 
that applying the techniques to density estimation can introduce negativity. 
When applying to more complicated settings, the covariance structure of 
the preliminary estimators needs to be investigated. 

7. Proofs. 

Proof of Theorem 1. Following [9], consider the version of the local 
linear estimator in (3.1). Denote Z„ = Oi{an) if E\Zn\^ = 0{a}'^) and Zn = 
oi{an) if E|^n|' = o(an)- For any fixed z £ [aa:,o,aa:,2], write a;,,^ = hKh{z - 

Xi){Sn,2iz) -{Z- X,)S„,l(z)} with Snjiz) = hY.'l=l{z " Xi^ Kh{z - Xi), 

j = 0,1,2. Then one can show that 

(7.1) n^/i^l^c^.^ + n-^l ={i.2o/'(z)}-i+04(l), 

n 

(7.2) - = n'h^ f\z)u2oSn,. + o^in^h"^), 

i=l 

where Sn,z = h-^E{miX) - m{z) - m'{z){X - z)}Kh{z - X). From (7.1) 
and (7.2),' 

(7.3) E{m(2)} = m{z) + \m" {z)v2Gh^ + o{h^ + {nh)-^''^}. 
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Note that, for j = 0,1, 2, 

(7.4) ^o(r)(-l - ry + Ai{r){-ry + ^2(r)(l - = (^oj, 

where (5oj- = 1 if j = and otherwise. Recall that x = ax^i + rojn G [a^.O; aa;,2] > 
-1 < r < 1, and uJn = Sh = 0^,2 - "x,! = "x,! - aa;,o- Using (7.3), Taylor ex- 
pansion and (7.4) we have 

E{mg{x)}= Ai{r)E{m{ax,i)} 

1=0,1,2 



(7.5) 



{m{x) + ^m" {x)i^2oh^} ^ Ai{r) 

i=0,l,2 

+ m(x) J2 Mr){ 



Ctx.i - X 



1=0,1,2 



+ im"(x) J2 Mr){a,,i-xf+o{h^ + {nh)~^/^} 



1=0,1,2 

= m{x) + im"(x)z^2o/i^ + o{h^ + (n/i)-i/2|_ 

Next we compute the variance of mq(x). For any u,v G [ax,o^'^x,2]y from (7.1) 
and (7.3), 



Cov {m{u),m{v)} = Gov 



Er=i ^iA^i - ^i^)} Er=i ^iAYi - ^i^)} 



+ 



1 



n 



Taking conditional expectations on Xi, . . . , Xn and using mean and variance 
decomposition yield 

Cov{m(ii), m(f )} 

'YA=i'-^i,u{'^{^i) - "^('")}E"=i'^j>{"^(^i) - ni{v)} 



■E 



(7.6) 



E 



+ E 



X]r=l ^i,u^i,vO''^{Xi) 



E 



J2'j^-i^ujj^y{m{Xj) -m{v)} 
+ o{{nh)-^} 



a2(x) 



j Kh{u - t)Kh{v -t)dt + o{h^ + (n/i)-^, 



nf{x) 

where the last equality follows from (7.1), (7.2) and Y17=i^i,u^i.v(^'^{Xi) = 
n^h^aHx)4j^ix)jKhiu - t)Kh{v - t)f{t)dt{l + O^/i" + {nhr^l^)}. 
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Then (3.5) is valid since 

Var{fhg{x)} = ^ Ai{rfYai{m{ax,i)} 

j=0,l,2 

+ YAi{r)Aj{r)Cov{ifi{a^^i),rh{ax,j)}. 

1=0,1,2 j^i n 

Proof of Proposition 1. Property (a) follows from / K{x-6/2)K{x + 
S/2) dx = J K{x)K{x — 6)dx = J K{x)K{x + 5) dx and writing 

C{6) = J {lK{xf - K{x)K{x + 5)- K{x)K{x - 6) 

(7.7) +^K{x-6)K{x + 6)}dx 

= J {K{x) - \K{x + 5)- ^K{x - 6)f dx. 

Property (b) can be shown by differentiating the right-hand side of (7.7). 
□ 

Proof of Proposition 2. From (3.7) and (3.9), 



D{6) - '"^^(^ 



- Var{m+(x)} + - Var{?n_(x)} — Var{?na(x)} 



a'^{x) 

X {1 + 0(1)} 

= Var{m+(x) - m„(x)}{l + o(l)}. ^ 

Proof of Theorem 3. Let tiijkix) = E{wijk{x)}, Aijk{x) = Wijk{x) - 
Hijkix), n*{x) = H2io{x)iJ,oii{x) - //iio(a;)//iii(x), Ui{x) = w;2io(a^)'«^oii(2;) - 
■wiio{x)'Wiii{x) and U2{x) = W2io{x)woio{x) — 'Wiiq{x). Note that 

Ui{x) = A2io(x)Aoii(x) - Aiio(x)Aiii(2;) + ^oii(a;)A2io(x) 

+ /U2io(2;)Aoii(x) - ^fiio(2;)Aiii(2;) - ^iii(x)Aiio(x) + /i*(x), 

U2{x) = A2io(x)Aoio(x) - AI^q{x) - 2^iio(x)Aiio(x) - f4iQ{x) 

+ ;U2io(2;)Aoio(x) + ^oio(2;)A2io(2;) + /U2io(a;)/ioio(a;), 

where A2io(x), Aoii(x), Aiio(x), Aiii(x), Aoio(x) and Aoi2(x) are all 
Op{(n/i)-V2} and M21o(x) = 0(1), ^lix) = 0{h^), /xiio(x) = 0{h), 
fJ-iu{x) = 0{h), ijlqiq{x) = 0(1) and ijlqi2{x) = 0(1). Then 



V^Ui{x) , , r ^ , Mx) , ,uUx)]~^/^ 

Tn{x) = . woio{x)<woi2{x) - 2won[x) , , + Woio{x)- 
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Ui{x)woio{x) 



X {woi2{x)U^{x) - 2won{x)Ui{x)U2{x)+woio{x)Uf{x)y^/^ 
^''^ {Ui{x)Aoio{x) + ^^olo{x)Ul{x)} 



X {U^{x)Aoi2ix) + fj,oi2{x)Ui{x) 

- 2Aoii(x)C/i(x)C/2(x) - 2fiou{x)Ui{x)U2{x) 

+ Aowix)Uf{x) + f,ow{x)Ufix)r^/^ 

[/i2io(a;)Aoii(x)Aoio(2:) + /ioio(a;)A2io(2;)Aoii(a;) 



-/ioio(2;)Aiio(x)Aiii(x) +/ioio(a;)/i2io(2;)Aoii(x) 
- Aioio(a^)Miio(2;)Aiii(x) - ^oio(2;)^iii(x)Aiio(2;) 
+ mo(3;)^*(x) + Op({(n/i)-i/2 + 

X [fJ-2wi^)fJ'0wi^)^012{x) + fJ^lio{x)nlio{x)floi2{x) 

+ 2/iiio(2;)mo(2;)^oi2(2;)Aoio(2;) 

+ 2^i2io(2;)^oio(^)^oi2(a:;)A2io(2;) 

+ Op({(n/i)-i/2 + /i}2)]-V2 

= r„i(x) + Op({(n/i)-i/2 + /i}2), 

where 

= {fJ'2ioix)nlio{x)fioi2{x)}-^^'^"-^^ 



^02 



We have 
E{T„i(x)} 



X Vn/i{^2io(a:;)Aoii(2;)Aoio(2;) + /Uoio(a:;)A2io(2;)Aoii(x) 

- mo(2;)Aiio(x)Aiii(x) +/ioio(a;)/i2io(2;)Aoii(x) 

- /ioio(2;)Aiiio(2;)Aiii(x) - /ioio(2;)wii (2^)^110(2;) 

+ fioio{x)fi*{x)} 

- 2"H^2io(^)/^oio(2;)m2(2;)}~^/^V2^^^^^^2io(^)Moio(3^) 
X {^2io(a;)^oio(2;)Aoii(2;)Aoi2(2;) 
+ 2^f2io(2;)m2(2;)Aoii(2;)Aoio(x) 
+ 2/ioio(2;)m2 (x) Aon (x) A210 (x)}. 

'^M/^2io(3^)/^oio(2;)^oi2(a;)}"^''^j^o2^^Voio(a^)^f*(2;) 
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2 ^{nh) ^^^{fJ-lio{x)fJ-lio{x)noi2ix)} 

X l'02^'^^J'21o{x)^iow{x)m3{x) 



+ 0({(n/i)"i/2 + /i}2) 

- 2~Hnhr'/^r'/\x)uli'a~Hx)V3{x) + 0{{{nhr'/^ + h}^), 
EM,ix)} = l + Oi{inh)-'/' + h}% 
E{Tniix)} = {nh)-^^^{fil^oix)fil^o{x)fioi2{x)uo2}-^^^ 

X {/-fo33(2;) - ^m2ix)m3ix)/m2ix)} + 0({(n/l)"^/2 + 

+ Oi{inh)-^/^ + h}^), 
E{T^i{x)} = 0({(n/i)-i/2 ^ ;^|2~) ; > 4^ 



Hence, by Edgeworth expansion (see, e.g., Chapter 2 of [14]), 



P{Tnl{x)<z} 

= $(z) + {nh^)^/H-^iy2iu~2^^^a~\x)f/\x)m"{x){z'^ - 3)(/.(z) 

- inhy'/^6-'u-i/'a-Hx)r'^\x)V3{x){uo3iz^ - 1) - 3ui,z^}<P{z) 
+ 0({(n/i)-V2+/,}2)^ 



and then applying the delta method yields (3.14). To calculate the coverage 
probability of I/j, write 





E Mr) \JvQ2/T^02Tn{ax,l) 



«=0,1,2 
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m{ax,i) -m{x) 



+ 



Since 



we have 



where 



Tn{ax,i) = T„i(a,,;) + Op{{{nhr'/^ + h}^), 

X VnJii^o2^'^f^oio{ax,i)l-i2io{ax,i)^oii{ax,i) 
X [l + Op{{nh)-^/^ + h}], 

T:{x) = T:,{x) + Op({(n/i)"V2 ^ ^Y), 



«=0,1,2 

-2~Voi2(2;)/ioio(2;) 
X {m2(ax,/)^oio("a;,«) " Mi2 (a:^)/ioio (2;)} 

X Vn/iz?o2^^^;Uoio(ax,/)mo(ax,«)^oii(aa;,0]- 
Then the following equalities and the delta method yield (3.15). 

-2-^f{xr^/^ul!,\-\x)Vi{x){nhr^l^ 
+ 0({(n/i)-i/2 + /i}2), 
E{T;i(x)2} = l + 0({(n/i)-i/2 + /i}2), 

HKlixf} = K)-l/2/(x)-l/V-3(x)?o2'/V3(x){z?03 " 2-l9z?o'2} 

+ 0{{{nhy^)^ + hf), 
E{7;*i(x)'} = 0({(n/i)-i/2^/i}2) for/>4. □ 
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